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II.  STATEMENT  OF  THE  PROBLEM  STUDIED 

Quantum  mechanical  methods  based  upon  the  IV-electron  wave  function  are  plagued  by  unfa¬ 
vorable  computational  scaling  with  respect  to  system  size,  and  an  exact  solution  to  the  Schrodinger 
equation  (within  a  given  basis  set)  formally  requires  computational  effort  that  scales  exponentially 
with  system  size.  For  this  reason,  the  full  configuration  interaction  (Cl)  method  is  only  applicable  to 
systems  having  roughly  a  dozen  electrons.  Many  methods  do  scale  more  favorably  with  system  size 
(e.g.  polynomially) ,  but  oftentimes  these  methods  have  been  developed  for  the  accurate  description 
of  only  dynamical  electron  correlation.  The  treatment  of  nondynamical  (or  static)  electron  corre¬ 
lation  is  challenging,  and  often  requires  a  complete-active-space  (CAS)  self-consistent-field  (SCF) 
description  of  some  reference  space.  Unfortunately,  CAS  computations  amount  to  performing  a  full 
Cl  computation  within  the  active  space,  the  cost  of  which  increases  exponentially  with  the  size  of 
the  active  space.  Methods  that  employ  the  two-electron  reduced- density  matrix  (2-RDM)  as  the 
central  variable  (instead  of  the  wave  function)  have  the  potential  to  overcome  this  scaling  problem. 
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Variational  2-RDM  (v-2RDM)  methods1-5  provide  a  reference  independent  description  of  the  elec¬ 
tronic  structure  of  many-electron  systems;  these  methods  also  naturally  capture  static  correlation 
effects.  However,  existing  implementations  of  the  V-2RDM  method  exhibit  a  time-to-solution  that 
is  often  prohibitively  large  for  the  general  description  of  large  molecules  or  materials. 

It  is  the  goal  of  this  STTR  to  develop  an  efficient,  parallel,  and  commercializable  2-RDM- 
based  electronic  structure  software  suite.  In  Phase  I  of  this  STTR,  we  have:  (i)  developed  a 
prototype  commercializable  software  product  to  perform  electronic  structure  computations  using 
the  variational  2-RDM  method,  (ii)  developed  two  semidehnite  programming  solvers  and  identified 
which  is  more  likely  to  be  of  use  for  solving  the  variational  2-RDM  problem  for  general  systems, 
(iii)  interfaced  this  solver  with  the  libraries  of  the  Q-Chem6  electronic  structure  package,  (iv) 
implemented  algorithms  that  use  either  loop-based  or  libtensor-based  tensor  manipulation  and 
identified  the  libtensor  algorithm  as  the  more  promising  for  massively  parallel  variational  2-RDM 
computations,  (v)  developed  a  pilot  distributed-memory  parallel  version  of  the  solver  through  an 
interface  between  libtensor  and  the  Cyclops  Tensor  Framework  (CTF),8  (vi)  enabled  the  automatic 
detection  and  use  of  graphics  processing  units  (GPUs)  for  certain  portions  of  the  algorithm,  and 
(vii)  extended  some  of  our  algorithms  for  the  description  of  open-shell  systems  and  to  include 
partial  three-particle  iV-representability  conditions.5 

III.  SUMMARY  OF  THE  MOST  IMPORTANT  RESULTS 
III  A.  Phase  I  objectives 

Our  stated  objectives  of  Phase  I  were  threefold: 

•  implementing  and  benchmarking  a  boundary-point  semidehnite  programming 
(SDP)  solver9  11  for  the  variational  2-RDM  method.  We  would  compare  the  perfor¬ 
mance  of  this  algorithm  to  the  matrix-factorization-based  algorithm  we  used  to  generate  our 
preliminary  data  in  the  Phase  I  proposal. 

•  interfacing  our  V-2RDM  code  with  libraries  from  the  Q-Chem  electronic  struc¬ 
ture  package6  to  develop  a  new  commercializable  2-RDM-based  software  suite.  A 

particular  emphasis  would  be  placed  on  implementing  the  V-2RDM  method  using  Q-Chem’s 
advanced  tensor  library,  libtensor.7 

•  developing  a  graphics  processing  unit  (GPU)-enabled  and  a  prototype  distributed- 
memory  parallel  variational  2-RDM  algorithm.  The  initial  parallel  implementation 
would  be  acheived  through  an  interface  between  libtensor  and  the  Cyclops  Tensor  Framework 
(CTF).8 

We  describe  below  how  each  of  these  objectives  was  met  during  the  Phase  I  award  period.  We  also 
describe  additional  milestones  achieved  that  were  not  explicitly  stated  as  part  of  the  Phase  I  work. 

Ill  B.  Phase  I  accomplishments 

Objective  1:  Developing  a  boundary  point  semidefinite  programming  solver 

Entering  Phase  1,  we  had  already  developed  a  matrix- factorization-based  semidehnite  program¬ 
ming  solver  for  the  V-2RDM  method  based  on  the  implementation  described  in  Ref.  12.  The  code 
was  implemented  as  a  plugin  to  the  Psi4  electronic  structure  package.13  The  algorithm  minimized 
the  electronic  energy  with  respect  to  the  elements  of  the  2-RDM  subject  to  the  set  of  two-particle 
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IV-representability14  conditions  known  as  the  DQG  (or  PQG)  conditions.15  The  first  goal  of  our 
Phase  I  effort  was  to  implement  a  different  solver,  called  a  boundary-point  semidefinite  solver, 
and  to  benchmark  the  performance  of  this  algorithm  relative  to  the  matrix-factorization-based 
algorithm.  In  the  first  month  of  Phase  I,  we  implemented  the  boundary-point  solver  (again  enforc¬ 
ing  the  DQG  conditions)  as  a  plugin  to  Psi4.  Table  I  illustrates  the  relative  performance  of  the 
boundary-point  and  matrix-factorization  algorithms  for  several  small  closed-shell  molecules  repre¬ 
sented  by  the  STO-3G  basis  set.  The  boundary-point  algorithm  is  clearly  the  superior  algorithm; 


TABLE  I:  Total  time  (s)  required  for  variational  2-RDM  optimizations  using  two  different 

semidefinite  programming  algorithms. 


molecule 

time  (s) 

speedup 

matrix  factorization 

boundary  point 

Be 

15 

1 

15 

LiH 

2506 

2 

1253 

BH 

12246 

9 

1361 

H20 

992 

4 

248 

n2 

31445 

28 

1123 

HF 

267 

2 

133 

CO 

670 

3 

223 

HCN 

4715 

7 

674 

for  several  molecules,  the  time  to  solution  is  more  than  1000  smaller  than  that  required  by  the 
matrix-factorization-based  algorithm.  Hence,  we  have  focused  all  of  our  subsequent  development 
efforts  on  the  more  promising  boundary-point  algorithm.  We  should  note  here  that  there  are  minor 
technical  differences  in  how  the  2-RDM  is  represented  within  the  boundary-point  and  matrix- 
factorization-based  algorithms.  In  generating  the  data  in  Table  I,  our  boundary-point  algorithm 
used  spin-adapted  basis  functions16  for  the  various  representations  of  the  2-RDM,  and  the  rnatrix- 
factorization-based  algorithm  did  not.  Spin  adaptation  does  boost  the  efficiency  of  the  algorithm, 
but  the  majority  of  the  speedups  presented  in  Table  I  can  be  attributed  to  the  efficiency  of  the 
boundary-point  semidefinite  programming  model  relative  to  that  of  the  matrix-factorization-based 
model. 


Objective  2:  Interfacing  our  variational  2-RDM  code  with  Q-Chem  libraries. 

We  next  interfaced  our  boundary-point  semidefinite  solver  with  the  low-level  libraries  of  the 
Q-Chem  electronic  structure  package.  The  initial  phase  of  this  work  was  completed  in  the  second 
month  of  the  Phase  I  award.  The  V-2RDM  method  can  now  be  invoked  using  either  a  Q-Chem- 
style  input  file  [Fig.  1(a)]  or  using  the  IQmol  molecular  visualizer  [Fig.  1(b)].  To  set  up  a 
V-2RDM  computation  within  IQmol,  a  user  can  construct  a  molecule  using  the  graphical  interface 
and  simply  change  the  default  method  from  HF  ( “Hartree-Fock” )  to  RDM.  Currently,  only  the 
default  options  for  the  V-2RDM  method  are  enabled  through  the  IQmol  interface;  in  Phase  II, 
we  will  modify  IQmol  so  as  to  allow  a  user  to  select  various  2-RDM-related  options  through  pull¬ 
down  menus.  The  options  available  through  the  text  interface  control  various  convergence  criteria 
and  the  positivity  conditions  to  be  enforced  (“DONLY,”  “DQ,”  “DQG,”,  etc.).  Once  a  V-2RDM 
computation  has  completed,  some  of  the  output  generated  can  be  interpreted  and  visualized  by 
IQmol.  For  example,  Figure  2  illustrates  one  of  the  natural  orbitals  (the  second-highest  occupied 
natural  orbital,  or  HONO-1)  for  a  water  molecule  computed  using  the  V-2RDM/STO-3G  method. 
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(a)  $molecule 
0  1 
0 

H  1  1.0 

H  1  1.0  2  104.5 

$end 

$rem 

correlation  RDM 

basis  6-31G 

rdm_positivity  DQG 
rdm_eps_convergence  4 
rdm_e_convergence  4 
rdm_cg_convergence  6 

$end 


a  neutral  singlet 


(b) 


selects  the  variational  2-RDM  method 
the  one-electron  basis  set 
enforces  DQG  two-positivity  conditions 
convergence  in  the  constraint  errors 
convergence  in  the  primal/dual  energy  gap 
convergence  in  the  conjugate  gradient  procedure 


FIG.  1:  Computations  using  the  V-2RDM  method  can  be  specified  using  either  (a)  a  Q-Chem-style 

text  input  file  or  (b)  the  IQmol  molecular  viewer. 


The  initial  implementation  of  the  boundary-point 
V-2RDM  method  interfaced  with  the  Q-Chem  li- 
brares  handled  the  IV-representability  constraints 
and  all  related  tensor  manipulations  explicitly  in 
low-level  C++  code,  therefore  providing  direct  ac¬ 
cess  to  tuning  the  program  for  performance.  Our 
second  implementation  codes  the  IV-representability 
constraints  using  the  high-level  tensor  expression 
language  of  the  libtensor  library.  This  library  was 
designed  for  rapid  development  of  tensor-algebra- 
based  computational  algorithms.  Performance  tun¬ 
ing  of  this  second  implementation  must  be  done  at 
the  level  of  libtensor’s  computational  kernels.  In  or¬ 
der  to  fully  appreciate  the  relative  performance  of 
the  two  implementations,  we  must  first  review  the 
general  structure  of  the  boundary-point  semidehnite  solver  (for  more  details  on  the  semidefinite 
algorithm,  please  see  Refs.  9-11).  Three  kernels  dominate  the  cost  of  the  solver:  (1)  the  action  of 
the  constraint  matrix,  A,  on  the  primal  solution  vector  x  [0(k4)],  (2)  the  action  of  the  transpose 
of  the  constraint  matrix,  AT,  on  the  dual  solution  vector,  y  [0(fc4)],  and  (3)  a  diagonalization  and 
transformation  step  to  update  the  primal  solution  [0(ke)].  Here,  k  represents  the  dimension  of  the 
one-electron  basis  set.  Kernel  3  is  evaluated  on  the  order  of  1000  times  for  a  given  optimization,  and 
kernels  1  and  2  are  typically  evaluated  50,000  to  100,000  times  per  optimization.  Hence,  for  small 
systems,  kernels  1  and  2  tend  to  dominate  the  cost  of  the  computation,  despite  their  fourth-power 
scaling. 

Figure  3  shows  a  timing  comparison  of  the  A  ■  x  +  A1  •  y  and  diagonalization/transformation 
steps  of  the  algorithm  for  increasing  system  size  (the  x-axis  represents  the  log  of  the  total  number 
of  basis  functions).  The  A  •  x  and  Ar  •  y  kernels  were  evaluated  1000  times  and  the  diagonal¬ 
ization/transformation  kernel  was  evaluated  10  times.  This  ratio  of  kernel  calls  reflects  that  of  a 
typical  V-2RDM  optimization.  In  this  log-log  plot  the  slope  of  each  line  corresponds  to  the  power 
of  the  computational  cost  scaling.  The  hand-coded  algorithm  exhibits  correct  scaling  (A;4)  for  all 
problem  sizes.  The  libtensor-based  implementation  is  slower  at  the  onset,  but  becomes  faster  than 


EBB  □  Eg]  BOH  © 


FIG.  2:  The  IQmol  molecular  viewer  can  dis¬ 
play  natural  orbitals  from  the  variational  2-RDM 
method. 
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FIG.  3:  Timing  comparison  of  the  A  ■  x  +  A1  ■  y  and  diagonalization/transformation  steps  of  the 
BPSDP  algorithm.  Two  implementations  of  A  ■  x  +  A1  •  y  are  the  hand- written  code  (blue)  and  using 
libtensor  (red).  The  diagonalization/transformation  step  is  shown  in  green. 


the  hand-written  code  at  about  30  orbitals,  and  significantly  faster  for  larger  problems.  Its  ap¬ 
parent  scaling  of  k 2  indicates  that,  for  problem  sizes  in  the  considered  range,  the  computation 
is  dominated  by  steps  that  have  a  lower  scaling  (and  a  very  large  prefactor)  than  the  algorithm 
as  a  whole;  this  invites  the  opportunity  for  further  performance  enhancements.  The  diagonaliza¬ 
tion/transformation  step  scales  close  to  the  sixth  power  in  problem  size,  as  expected,  and  starts  to 
dominate  the  calculation  with  30  orbitals  and  larger. 


Objective  3:  Node-level  and  distributed-memory  parallelism 

We  now  consider  the  shared-memory  parallel  scaling  of  the  A  ■  x,  A1  •  y,  and  diagonaliza¬ 
tion/transformation  steps.  Timings  for  1000  evaluations  of  the  A  •  x  and  A1  •  y  steps  and  10  calls 
to  the  diagonalization/transformation  step  are  given  in  seconds,  with  scaling  factors  in  parentheses. 
These  computations  were  performed  on  an  Intel  Xeon  E5-2690  system  (2  processors  with  8  cores 
each).  To  study  the  strong  scaling  properties  of  the  algorithm,  we  considered  a  system  with  192 
orbitals.  Timings  and  scaling  factors  are  provided  in  Table  II.  At  present,  the  parallel  speedup 
for  the  A  •  x  and  A1  •  y  steps  does  not  exceed  2-2.5  even  at  full  CPU  load.  The  theoretical  peak 
performance  for  memory-bound  algorithms  on  this  computer  system  would  yield  a  4-6x  speedup. 

The  weak  scaling  properties  of  the  steps  of  the  algorithm  are  shown  in  Fig.  4.  For  each  increase 
in  problem  size,  a  proportional  increase  in  the  number  of  computing  processors  was  used.  With 
perfect  parallel  scaling,  the  polynomial  degree  of  the  cost  of  each  step  would  be  reduced  by  one.  We 
observe  perfect  scaling  for  the  diagonalization/transformation  step,  a  CPU  bound  problem.  The 
A  •  x  and  Ar  •  y  steps  scale  poorly  as  they  are  memory-bound  problems.  However,  for  all  problem 
sizes  considered,  except  for  the  smallest,  the  diagonalization/transformation  step  dominates  the 
entire  computation  and  therefore  is  the  best  candidate  for  parallelization. 


5 


Firm:  Q-Chem,  Inc. 


Topic:  A14A-T013 


Proposal:  A14A-013-0061 


TABLE  II:  Parallel  scaling  of  the  A  •  x,  A1  •  y  and  diagonalization/transformation  steps.  Timings 
for  1000  iterations  of  A  ■  x  and  A1  •  y  steps  and  10  iterations  of  diagonalization/transformation  are 
given  in  seconds,  scaling  factors  in  parentheses.  These  computations  were  performed  on  an  Intel  Xeon 

E5-2690  system  (2  processors  with  8  cores  each). 


processors 

Ax 

AT-y 

i 

366 

191 

2 

223  (1.6) 

143  (1.3) 

4 

173  (2.1) 

127  (1.5) 

8 

144  (2.5) 

115  (1.7) 

16 

138  (2.7) 

115  (1.7) 

FIG.  4:  Weak  scaling  properties  of  the  A  ■  x  (blue),  A1  •  y  (red)  and  diagonalization/transformation 
(green)  steps  in  the  libtensor  implementation  of  the  V-2RDM  method. 


no.  basis  functions 

FIG.  5:  GPU  and  CPU  timings  (s)  for  the  diag¬ 
onalization/transformation  step  of  the  boundary- 
point  algorithm. 


Our  pilot  distributed-memory  implementation  is 
based  on  libtensor’s  use  of  the  Cyclops  Tensor  Frame¬ 
work  (CTF),8  a  distributed  tensor  algebra  library, 
as  a  back-end.  While  we  succeeded  in  developing 
a  functioning  distributed-memory  parallel  algorithm 
using  CTF,  our  preliminary  results  show  that  the 
current  version  of  the  code  does  not  achieve  strong  or 
weak  scaling  in  the  A-x  and  A1  -y  steps.  The  cause 
of  this  failure  is  to  be  investigated  further  during 
Phase  II,  and  possible  reasons  include  inefficiencies 
in  interfacing  libtensor  with  CTF  and  inefficiencies 
within  CTF,  neither  of  which  were  specifically  opti¬ 
mized  for  this  use  case.  Nonetheless,  in  the  regime 
of  large  problems,  which  is  the  target  of  the  dis¬ 
tributed  memory  implementation,  the  diagonaliza- 
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tion/transformation  step  of  the  algorithm  strongly 
dominates  the  total  walltime.  This  step  will  be  more  easily  parallelizable,  and  during  Phase  II 
we  will  use  one  of  the  very  efficient  existing  distributed  parallel  eigensolvers  (e.g.  ScaLAPACK, 
PLAPACK,  Elemental)  for  this  step. 

We  have  also  explored  the  possibility  of  accelerating  the  diagonalization/transformation  kernel 
through  the  use  of  graphics  processing  units  (GPUs),  which  are  automatically  detected  and  utilized 
by  our  software  (if  the  software  is  configured  with  a  “cuda”  option).  Figure  5  provides  the  time 
required  to  evaluate  the  diagonalization/transformation  step  10  times  for  systems  of  varying  sizes. 
This  step  is  evaluated  using  either  a  single  core  of  an  Intel  Core  i7  3930k  CPU  or  an  NVIDIA 
Tesla  K40c  (Kepler)  GPU.  The  GPU  is  utilized  through  the  culaDsyev  call  of  the  CULA  linear 
algebra  library  (for  the  diagonalization)  and  the  cublasDgemm  call  of  the  NVIDIA  CUDA  Basic 
Linear  Algebra  Subroutine  (cuBLAS)  library  (for  the  transformation).  For  the  systems  considered, 
the  GPU  is  as  much  as  3.5  times  more  efficient  than  the  CPU.  This  comparison  is  not  completely 
fair  in  that  the  CPU  code  is  utilizing  only  one  of  six  physical  cores.  However,  it  does  demonstrate 
the  ability  of  the  software  to  detect  and  utilize  GPUs  to  accelerate  this  portion  of  the  algorithm. 
Further,  these  computations  consider  two-particle  A-representability  conditions,  and  dimensions 
of  the  matrices  being  diagonalized  are  not  large  enough  to  fully  exploit  the  GPU.  As  is  discussed 
below,  we  have  also  implemented  a  set  of  partial  three-particle  A-representability  conditions5  for 
which  this  step  would  involve  the  diagonalization  and  transformation  of  much  larger  matrices.  We 
will  investigate  this  case  in  Phase  II,  and  we  expect  the  utility  of  GPUs  to  become  more  apparent. 

IV.  ADDITIONAL  MILESTONES 

To  this  point,  we  have  only  considered  spin- 
adapted,  closed-shell  implementations  of  the  two- 
particle  A-representability  conditions  known  as  the 
DQG  (or  PQG)  conditions.  We  have  also  general¬ 
ized  our  hand-written  (non-libtensor)  code  to  per¬ 
form  open-shell  optimizations.  Unlike  the  closed- 
shell  implementation,  the  open-shell  implementation 
does  not  use  spin-adapted  basis  functions.  Upon  in¬ 
spection  of  the  spin  block  structure  of  each  of  the 
representations  of  the  2-RDM  (particularly  the  2G 
matrix)  for  non-singlet  states,  it  becomes  clear  that 
spin  adaptation  is  not  does  not  yield  much  compu¬ 
tational  savings  for  open-shells,18  and  a  much  sim¬ 
pler  algorithm  can  be  designed  without  the  use  of 
spin-adapted  basis  functions.  One  must  then  explic¬ 
itly  constrain  the  expectation  value  of  S2  in  order  to 
maintain  a  pure  spin  state,16  but  this  constraint  is 
quite  easily  enforced. 

We  recently  applied  our  open-shell  variational  2-RDM  code  to  the  evaluation  of  singlet/triplet 
gaps  in  the  linear  acene  molecules.19  We  have  explored  the  performance  of  the  V-2RDM  method 
with  the  DQG  two-particle  A-representability  conditions  for  molecules  as  large  as  to  12-acene 
(dodecacene)  represented  by  the  STO-3G  basis  set.  Our  computations  can  be  thought  of  as  a 
complete-active-space  (CAS)  calculation  where  only  the  orbitals  of  the  7r-network  are  considered  as 
active  (the  occupations  of  all  other  occupied  and  virtual  orbitals  are  held  fixed  at  their  Hartree-Fock 
values).  The  largest  computation  (12-acene)  is  comparable  to  a  complete  active  space  computation 


(n)-acene 

FIG.  6:  Singlet /triplet  gaps  (kcal  mol-1)  for 
acene  molecules  of  increasing  length.  DMRG, 
UBLYP,  and  UB3LYP  results  results  are  taken 
from  Ref.  17. 
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with  50  electrons  in  50  orbitals.  This  system  size  is  clearly  beyond  what  can  be  treated  using 
conventional  configuration-interaction-based  CAS  methods.  Figure  6  illustrates  the  singlet/triplet 
gap  computed  at  the  variational  2- RDM  (DQG),  density  matrix  renormalization  group  (DMRG), 
and  density  functional  theory  (DFT)  levels  of  theory.  The  DFT  computations  use  the  B3LYP 
and  BLYP  functionals  and  a  6-31g(d)  basis  set.  DMRG  and  DFT  results  are  taken  from  Ref.  17. 


natural  orbital  number 


FIG.  7:  Natural  orbital  occupations  for  the  ground 
state  of  linear  polyacenes  obtained  from  the  varia¬ 
tional  2-RDM  (DQG)  level  of  theory. 


We  see  that  the  2-positivity  conditions  (DQG) 
yield  singlet /triplet  gaps  in  these  systems  that 
are  qualitatively  similar  to  those  obtained  from 
DMRG  computations.  Both  DMRG  and  varia¬ 
tional  2RDM  computations  suggest  that  the  sin¬ 
glet/triplet  gap  decreases  monotonically  with  in¬ 
creasing  system  size  and  should  converge  to  some 
finite  value.  This  behavior  differs  from  the  DFT 
methods  that  suggest  the  singlet/triplet  gap  actu¬ 
ally  increases  between  10-  and  12-acene.  To  our 
knowledge,  these  data  represent  the  largest 
direct  comparison  of  the  quality  of  varia¬ 
tional  2-RDM  methods  to  other  methods 
for  a  chemically  meaningful  open-shell  sys¬ 
tems.  We  also  present  in  Fig.  7  the  natural 
orbital  occupations  for  the  singlet  state  of  each 
molecule  in  the  linear  acene  series.  We  observe 

the  same  emergence  of  polyradical  behavior  reported  in  Ref.  17  at  the  DMRG  level  of  theory,  but 
we  note  that  the  V-2RDM  method  tends  to  over  correlate  the  electrons.  This  overcorrelation  man¬ 
ifests  itself  in  a  greater  degree  of  occupation  degeneracy  between  the  highest  occupied  and  lowest 
unoccupied  natural  orbitals. 

While  singlet /triplet  gaps  from  the  V-2RDM  method  do  display  excellent  qualitative  agreement 
with  those  obtained  from  DRMG,  the  absolute  deviations  from  the  DMRG  values  can  be  as  large  as 
2  kcal  mol-1.  Fortunately,  the  V-2RDM  method  is  systematically  improvable  through  the  enforce¬ 
ment  of  additional  N-representability  conditions.  Accordingly,  we  recently  implemented  stronger 
partial  three-particle  IV-representability  conditions  known  as  the  T1  and  T2  conditions.'5  Figure  8 
illustrates  the  error  (in  milihartrees,  mH)  for  various  IV-representability  conditions  relative  to  full 
Cl  computations  for  the  N2  dissociation  using  the  STO-6G  basis  set.  Clearly  the  T2  condition 
greatly  improves  the  quality  of  the  potential  energy  curve.  The  error  near  the  equilibrium  bond 
length  is  roughly  1  mH,  and  the  largest  errors  over  the  curve  are  only  about  5  mH.  We  have  only 
implemented  these  conditions  in  our  plugin  to  Psi4,  but  interfacing  the  code  with  Q-Chem  libraries 
should  be  straightforward  at  this  point.  Our  Phase  II  proposal  will  include  efforts  to  implement 
these  T1  and  T2  conditions  in  an  efficient  way  using  libtensor.  Note  that  these  conditions  are 
currently  expressed  using  non-spin-adapted  basis  functions  and  are  generalized  to  treat  both  open- 
and  closed-shell  systems. 


V.  CONCLUSIONS 

Over  the  past  six  months,  we  have  achieved  every  milestone  outlined  in  our  Phase  I  proposal.  We 
successfully  implemented  a  boundary-point  semidefinite  solver  in  Month  1  that  was  demonstrated 
to  be  15-1360  times  more  efficient  than  our  initial  matrix-factorization-based  algorithm.  It  is  clear 
that  the  boundary-point  solver  is  the  solver  of  choice  for  moving  forward  with  this  STTR.  In  Month 
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N-N  bond  length  (A) 


FIG.  8:  Errors  in  the  electronic  energy  (mH)  obtained  from  variational  2-RDM  computations  with 
various  positivity  conditions  throughout  the  N2  dissociation  curve.  The  basis  set  is  STO-6G,  and  the 
errors  are  determined  relative  to  full  configuration  interaction  computations. 


2,  we  interfaced  our  boundary-point  solver  with  the  low-level  libraries  of  the  Q-Chem  electronic 
structure  package.  In  Month  3,  we  reimplemented  the  A  ■  x  kernel  of  the  semidefinite  solver  using 
the  libtensor  library  and  demonstrated  that  the  libtensor  solver  could  be  substantially  more  efficient 
than  the  hand-written  solver  for  large  system  sizes.  In  Month  4,  as  we  continued  reimplementing 
the  boundary-point  solver,  we  implemented  hand-optimized  open-shell  semidefinite  solvers.  As 
was  demonstrated  above,  the  variational  2-RDM  method  with  the  DQG  conditions  can  be  used 
to  obtain  a  qualitatively  correct  description  of  the  decrease  in  the  singlet /triplet  gap  in  linear 
acene  molecules  of  increasing  size.  In  month  5,  we  finished  reimplementing  a  pilot  version  of  the 
semidefinite  solver  in  Q-Chem  using  the  libtensor  library.  Again,  for  large  systems,  the  evaluation 
of  the  A  ■  x  and  AT  •  y  kernels  is  more  efficient  when  using  the  libtensor  library.  We  spent 
Month  6  optimizing  the  libtensor  code  and  enabling  point-group  symmetry  in  that  algorithm  (we 
should  note  that  all  of  our  algorithms  exploit  abelian  point-group  symmetry).  We  have  presented 
preliminary  scaling  data  for  both  shared-memory  parallelism  and  have  developed  a  pilot  distributed- 
memory  parallel  code  using  a  combination  of  libtensor  and  CTF.  We  have  also  demonstrated  that  our 
software  can  utilize  graphics  processing  units  to  accelerate  at  least  one  portion  of  the  boundary- 
point  semidefinite  programming  solver.  We  have  also  implemented  two  partial  three-positivity 
conditions  that  substantially  increase  the  quality  of  the  V-2RDM  method.  Part  of  Phase  II  would 
involve  reimplementing  these  conditions  using  libtensor. 
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